1. Import and EDA

1.1. Import required library

tidyverse library ver. 1.3.1 is used for processing data.

This library include all required libraries:
- ggplot2 3.3.5, for data visualization.
- dplyr 1.0.8, for data manipulation.
- tidyr 1.2.0, for data tidying.
- readr 2.1.2, for data import.
- purrr 0.3.4, for functional programming.

as well as tibble ver. 3.1.6, stringr ver 1.4.0, ver. forcats 0.5.1.

Also readxl used for read .xls files.

For data processing and analysis car, caTools, caret, lmtest, multcomp, vegan, vegan3d, plot3D, scatterplot3d, gridExtra, and limma were used.

1.2. Import .xls data

We load data from all .xls into one variable.

.xls files with raw data should be placed in ./data/ folder.

The data:

1.3. Data description

There are 1080 rows and 82 variables in the data.

There are 72 mice in experiment. All mouses belongs to one of 8 classes:

Classes: - c-CS-s: control mice, stimulated to learn, injected with saline - c-CS-m: control mice, stimulated to learn, injected with memantine - c-SC-s: control mice, not stimulated to learn, injected with saline - c-SC-m: control mice, not stimulated to learn, injected with memantine

  • t-CS-s: trisomy mice, stimulated to learn, injected with saline
  • t-CS-m: trisomy mice, stimulated to learn, injected with memantine
  • t-SC-s: trisomy mice, not stimulated to learn, injected with saline
  • t-SC-m: trisomy mice, not stimulated to learn, injected with memantine

There are mouses with 2 genotypes and 2 behavior under 2 types of treatment in dataset:

If we drop rows with NA in any column:

When deleting rows with missing values, the data is not balanced across groups. In two cases, the control group is twice as large as the group of trisomic mice. When counted with rows with missing values, the data is more or less balanced.

The number of complete observations is 528 (there are 1080 observations in total).

Violin and box-plots for each protein (after standardization) are in ..\data\plots\ folder.

2. Are there differences in the level of BDNF_N production depending on the class in the experiment?

Delete Outliers using by Interquartile Range Rule and NA in each class.

We have a sufficient number of observations to consider the distribution of BDNF_N variable as a normal distribution, which can be seen on the graph:

We fit a linear regression model to the data and perform a Shapiro-Wilk statistical test to test the residuals for normality:

## 
## Call:
## lm(formula = BDNF_N ~ class, data = mouse_wo_outl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13623 -0.02822 -0.00125  0.02866  0.13426 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33834    0.00370   91.43  < 2e-16 ***
## classc-CS-s  0.00282    0.00538    0.52   0.6001    
## classc-SC-m -0.04621    0.00523   -8.83  < 2e-16 ***
## classc-SC-s -0.02418    0.00538   -4.50  7.7e-06 ***
## classt-CS-m -0.02676    0.00538   -4.98  7.6e-07 ***
## classt-CS-s -0.03712    0.00580   -6.39  2.4e-10 ***
## classt-SC-m -0.01728    0.00537   -3.22   0.0013 ** 
## classt-SC-s -0.01275    0.00540   -2.36   0.0184 *  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0452 on 1061 degrees of freedom
## Multiple R-squared:  0.114,  Adjusted R-squared:  0.108 
## F-statistic: 19.5 on 7 and 1061 DF,  p-value: <2e-16

The residuals of the linear regression model is normally distributed. Shapiro-Wilk normality test p-value = 0.044 (number of observations is 1069, W statistics is 0.997).

We accept it as possible to carry out one-way analysis of variance (one-way ANOVA).

anova_test_BDNF <- Anova(mod_BDNF)

The levels of BDNF_N production depend on the class in the experiment (F = 19.465375, p_value = 1.29894^{-24} , df_1 = 7, df_2 = 1061).

2.1. Perform the post-hoc (TUKEYS HSD)

To identify pairwise differences, the Tukey post-hock test was performed (df = 1061).

There are statistically significant differences in the levels of BDNF_N production depend on the class in the experiment when pairwise comparison of the production of various class:

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = BDNF_N ~ class, data = mouse_wo_outl)
## 
## Linear Hypotheses:
##                      Estimate Std. Error t value Pr(>|t|)
## c-CS-s - c-CS-m == 0  0.00282    0.00538    0.52   0.9995
## c-SC-m - c-CS-m == 0 -0.04621    0.00523   -8.83   <0.001
## c-SC-s - c-CS-m == 0 -0.02418    0.00538   -4.50   <0.001
## t-CS-m - c-CS-m == 0 -0.02676    0.00538   -4.98   <0.001
## t-CS-s - c-CS-m == 0 -0.03712    0.00580   -6.39   <0.001
## t-SC-m - c-CS-m == 0 -0.01728    0.00537   -3.22   0.0284
## t-SC-s - c-CS-m == 0 -0.01275    0.00540   -2.36   0.2605
## c-SC-m - c-CS-s == 0 -0.04903    0.00538   -9.12   <0.001
## c-SC-s - c-CS-s == 0 -0.02700    0.00552   -4.89   <0.001
## t-CS-m - c-CS-s == 0 -0.02958    0.00552   -5.36   <0.001
## t-CS-s - c-CS-s == 0 -0.03994    0.00594   -6.73   <0.001
## t-SC-m - c-CS-s == 0 -0.02010    0.00551   -3.65   0.0067
## t-SC-s - c-CS-s == 0 -0.01557    0.00554   -2.81   0.0933
## c-SC-s - c-SC-m == 0  0.02203    0.00538    4.10   0.0012
## t-CS-m - c-SC-m == 0  0.01946    0.00538    3.62   0.0075
## t-CS-s - c-SC-m == 0  0.00910    0.00580    1.57   0.7694
## t-SC-m - c-SC-m == 0  0.02894    0.00537    5.39   <0.001
## t-SC-s - c-SC-m == 0  0.03346    0.00540    6.20   <0.001
## t-CS-m - c-SC-s == 0 -0.00258    0.00552   -0.47   0.9998
## t-CS-s - c-SC-s == 0 -0.01294    0.00594   -2.18   0.3647
## t-SC-m - c-SC-s == 0  0.00690    0.00551    1.25   0.9152
## t-SC-s - c-SC-s == 0  0.01143    0.00554    2.06   0.4393
## t-CS-s - t-CS-m == 0 -0.01036    0.00594   -1.75   0.6561
## t-SC-m - t-CS-m == 0  0.00948    0.00551    1.72   0.6728
## t-SC-s - t-CS-m == 0  0.01400    0.00554    2.53   0.1846
## t-SC-m - t-CS-s == 0  0.01984    0.00593    3.35   0.0189
## t-SC-s - t-CS-s == 0  0.02437    0.00595    4.09   0.0012
## t-SC-s - t-SC-m == 0  0.00452    0.00553    0.82   0.9921
##                         
## c-CS-s - c-CS-m == 0    
## c-SC-m - c-CS-m == 0 ***
## c-SC-s - c-CS-m == 0 ***
## t-CS-m - c-CS-m == 0 ***
## t-CS-s - c-CS-m == 0 ***
## t-SC-m - c-CS-m == 0 *  
## t-SC-s - c-CS-m == 0    
## c-SC-m - c-CS-s == 0 ***
## c-SC-s - c-CS-s == 0 ***
## t-CS-m - c-CS-s == 0 ***
## t-CS-s - c-CS-s == 0 ***
## t-SC-m - c-CS-s == 0 ** 
## t-SC-s - c-CS-s == 0 .  
## c-SC-s - c-SC-m == 0 ** 
## t-CS-m - c-SC-m == 0 ** 
## t-CS-s - c-SC-m == 0    
## t-SC-m - c-SC-m == 0 ***
## t-SC-s - c-SC-m == 0 ***
## t-CS-m - c-SC-s == 0    
## t-CS-s - c-SC-s == 0    
## t-SC-m - c-SC-s == 0    
## t-SC-s - c-SC-s == 0    
## t-CS-s - t-CS-m == 0    
## t-SC-m - t-CS-m == 0    
## t-SC-s - t-CS-m == 0    
## t-SC-m - t-CS-s == 0 *  
## t-SC-s - t-CS-s == 0 ** 
## t-SC-s - t-SC-m == 0    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Groups between which there was a statistically significant difference in BDNF_N protein production were marked with one *, two ** and three ***:

  • the level of BDNF_N production in the experimental group c-CS-m is statistically significantly higher than the level of BDNF_N production in the
    • c-SC-m group (t-statistic is -8.831039, p-value is 1.110223^{-16}),
    • c-SC-s group (t-statistic is -4.496425, p-value is 0.000212),
    • t-CS-m group (t-statistic is -4.975456, p-value is 2.050985^{-5}),
    • t-CS-s group (t-statistic is -6.394443, p-value is 3.377129^{-8}),
    • t-SC-m group (t-statistic is -3.218881, p-value is 0.028381).
  • the level of BDNF_N production in the experimental group c-CS-s is statistically significantly higher than the level of BDNF_N production in the
    • c-SC-m group (t-statistic is -9.1183, p-value is 0),
    • c-SC-s group (t-statistic is -4.892914, p-value is 2.941933^{-5}),
    • t-CS-m group (t-statistic is -5.359733, p-value is 2.994139^{-6}),
    • t-CS-s group (t-statistic is -6.728985, p-value is 2.020564^{-10}),
    • t-SC-m group (t-statistic is -3.648578, p-value is 0.006706).
  • the level of BDNF_N production in the experimental group c-SC-m is statistically significantly lower than the level of BDNF_N production in the
    • c-SC-s group (t-statistic is 4.09739, p-value is 0.001212),
    • t-CS-m group (t-statistic is 3.618359, p-value is 0.007494),
    • t-SC-m group (t-statistic is 5.391741, p-value is 4.201719^{-6}),
    • t-SC-s group (t-statistic is 6.197646, p-value is 8.693807^{-9}).
  • the level of BDNF_N production in the experimental group t-CS-s is statistically significantly lower than the level of BDNF_N production in the
    • t-SC-m group (t-statistic is 3.348428, p-value is 0.018877)
    • t-SC-s group (t-statistic is 4.091753, p-value is 0.001167).

3. Linear model for ERBB4_N

Delete rows with many NA:

Count NA’s in each column and return column names which contains NA’s (with a number of Na’s):

##      ELK_N      MEK_N Bcatenin_N      BAD_N     BCL2_N 
##         15          4         15        213        285 
##    pCFOS_N  H3AcK18_N     EGR1_N   H3MeK4_N 
##         75        180        210        270

Remove columns with more than 100 NA’s and after this - rows with NA.

There are 972 rows and 77 variables in the new data.

3.1. Divide the data into training and test samples and Normalize

3.2. Linear model

## 
## Call:
## lm(formula = ERBB4_N ~ ., data = norm.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6162 -0.2644 -0.0043  0.2665  1.6251 
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.316564   0.191278    1.65  0.09840 .  
## DYRK1A_N        -0.739740   0.445301   -1.66  0.09715 .  
## ITSN1_N          0.431635   0.128979    3.35  0.00086 ***
## BDNF_N          -0.006208   0.056456   -0.11  0.91247    
## NR1_N           -0.283760   0.109536   -2.59  0.00979 ** 
## NR2A_N          -0.082195   0.074829   -1.10  0.27242    
## pAKT_N           0.196285   0.067807    2.89  0.00392 ** 
## pBRAF_N         -0.073160   0.049169   -1.49  0.13725    
## pCAMKII_N       -0.035047   0.049005   -0.72  0.47475    
## pCREB_N         -0.126384   0.055048   -2.30  0.02200 *  
## pELK_N          -0.035600   0.037029   -0.96  0.33669    
## pERK_N          -0.034678   0.069986   -0.50  0.62041    
## pJNK_N          -0.137350   0.069811   -1.97  0.04955 *  
## PKCA_N           0.282964   0.071014    3.98  7.5e-05 ***
## pMEK_N           0.022133   0.064542    0.34  0.73176    
## pNR1_N          -0.141583   0.086192   -1.64  0.10093    
## pNR2A_N          0.091970   0.066353    1.39  0.16619    
## pNR2B_N          0.133880   0.086790    1.54  0.12341    
## pPKCAB_N         0.011033   0.076577    0.14  0.88548    
## pRSK_N           0.048000   0.049893    0.96  0.33637    
## AKT_N            0.086348   0.061566    1.40  0.16123    
## BRAF_N          -0.259251   0.126212   -2.05  0.04036 *  
## CAMKII_N         0.054620   0.063491    0.86  0.38995    
## CREB_N          -0.069386   0.046194   -1.50  0.13356    
## ELK_N            0.036247   0.079986    0.45  0.65058    
## ERK_N            0.248546   0.092547    2.69  0.00742 ** 
## GSK3B_N          0.000181   0.086711    0.00  0.99833    
## JNK_N            0.053428   0.061683    0.87  0.38671    
## MEK_N           -0.039407   0.053434   -0.74  0.46109    
## TRKA_N          -0.058324   0.094674   -0.62  0.53808    
## RSK_N            0.076056   0.064033    1.19  0.23536    
## APP_N           -0.046249   0.044811   -1.03  0.30240    
## Bcatenin_N       0.105490   0.111162    0.95  0.34298    
## SOD1_N          -0.053581   0.045364   -1.18  0.23798    
## MTOR_N           0.178843   0.061983    2.89  0.00404 ** 
## P38_N           -0.021191   0.067464   -0.31  0.75354    
## pMTOR_N         -0.101488   0.059056   -1.72  0.08617 .  
## DSCR1_N         -0.037512   0.057192   -0.66  0.51212    
## AMPKA_N         -0.049791   0.081247   -0.61  0.54020    
## NR2B_N           0.002035   0.048894    0.04  0.96681    
## pNUMB_N         -0.037937   0.064028   -0.59  0.55371    
## RAPTOR_N         0.066184   0.072824    0.91  0.36378    
## TIAM1_N          0.006683   0.068417    0.10  0.92221    
## pP70S6_N         0.009730   0.049914    0.19  0.84550    
## NUMB_N          -0.109550   0.059067   -1.85  0.06409 .  
## P70S6_N         -0.057986   0.050614   -1.15  0.25235    
## pGSK3B_N         0.041136   0.047073    0.87  0.38250    
## pPKCG_N         -0.335286   0.063631   -5.27  1.9e-07 ***
## CDK5_N           0.013850   0.026755    0.52  0.60487    
## S6_N             0.065872   0.051731    1.27  0.20335    
## ADARB1_N        -0.040814   0.039372   -1.04  0.30029    
## AcetylH3K9_N     0.135841   0.058159    2.34  0.01981 *  
## RRP1_N          -0.037075   0.023156   -1.60  0.10983    
## BAX_N            0.014668   0.037831    0.39  0.69834    
## ARC_N            0.167255   0.049050    3.41  0.00069 ***
## nNOS_N          -0.031950   0.037210   -0.86  0.39084    
## Tau_N            0.185728   0.062368    2.98  0.00301 ** 
## GFAP_N          -0.041150   0.035860   -1.15  0.25159    
## GluR3_N         -0.062585   0.035806   -1.75  0.08095 .  
## GluR4_N         -0.035215   0.023093   -1.52  0.12775    
## IL1B_N           0.075211   0.050746    1.48  0.13879    
## P3525_N          0.193371   0.038600    5.01  7.0e-07 ***
## pCASP9_N         0.170040   0.045854    3.71  0.00023 ***
## PSD95_N          0.391634   0.047308    8.28  7.0e-16 ***
## SNCA_N           0.009827   0.047337    0.21  0.83561    
## Ubiquitin_N     -0.042071   0.049402   -0.85  0.39475    
## pGSK3B_Tyr216_N  0.180801   0.052661    3.43  0.00063 ***
## SHH_N           -0.000416   0.031477   -0.01  0.98946    
## pS6_N                  NA         NA      NA       NA    
## pCFOS_N         -0.038880   0.036897   -1.05  0.29239    
## SYP_N            0.079747   0.034568    2.31  0.02137 *  
## CaNA_N          -0.115093   0.068908   -1.67  0.09535 .  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.446 on 658 degrees of freedom
## Multiple R-squared:  0.82,   Adjusted R-squared:  0.801 
## F-statistic: 42.8 on 70 and 658 DF,  p-value: <2e-16

Residual standard error: 0.424 on 658 degrees of freedom

Multiple R-squared: 0.819925, Adjusted R-squared: 0.800768.

F-statistic: 43 on 70 and 658 DF, p-value: <2e-16.

3.3. Inspect Linear model

Residuals vs Fitted plot - Linear relationship, that is good.

Normal Q-Q - the tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. It`s OK.

Scale-Location plot - let’s do Breusch Pagan Test to check for heteroskedasticity more formally:

Result: as expected, there is heteroskedasticity.

Residuals vs Leverage (observations whose standardized residuals are greater than 3 in absolute value are possible outliers) - there are few influential observations, but no outliers.

3.4. Prediction?

pred_y_lm <-  predict(model, dplyr::select(norm.test, -ERBB4_N))
## Warning in predict.lm(model, dplyr::select(norm.test,
## -ERBB4_N)): prediction from a rank-deficient fit may be
## misleading

Warning can tell us us about multi-collinearity in data. It can lead to a rank deficient matrix in logistic regression. We can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.

MAE = 0.309649. The model contains too many predictors and some problems, it might be better.

3.5. Linear model conclusion

The model contains too many predictors, heteroskedasticity, and also multi-collinearity in data so it might be better.

4. PCA

4.1. Run PCA on all train set

Run principal components analysis on all train set.

There are 70 principal components.

Let’s analyze the results using the eigenvalue plot:

You can see that there are maybe enough 7 principal components.

The Cumulative Proportion from summary:

90% of the variability can be explained by 17 main components. Let’s try to apply 17 main components to building a linear model.

4.2. Getting Principal Component Scores and Data Transformation

Getting Principal Component Scores and Train data transformation.

Test data transformation.

4.3. Plot 3D plot (3 principal components)

4.4. New linear regression after PCA

## 
## Call:
## lm(formula = ERBB4_N ~ ., data = pca_train_17)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8789 -0.3003 -0.0137  0.3078  1.4854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01250    0.01832    0.68  0.49536    
## PC1         -0.07795    0.00414  -18.82  < 2e-16 ***
## PC2         -0.06810    0.00543  -12.55  < 2e-16 ***
## PC3          0.20823    0.00672   31.00  < 2e-16 ***
## PC4         -0.13707    0.00786  -17.44  < 2e-16 ***
## PC5          0.05942    0.00992    5.99  3.3e-09 ***
## PC6          0.09799    0.01032    9.49  < 2e-16 ***
## PC7         -0.04772    0.01187   -4.02  6.4e-05 ***
## PC8          0.00478    0.01285    0.37  0.71021    
## PC9          0.15376    0.01439   10.69  < 2e-16 ***
## PC10         0.12291    0.01626    7.56  1.3e-13 ***
## PC11        -0.07667    0.01751   -4.38  1.4e-05 ***
## PC12         0.21888    0.01904   11.49  < 2e-16 ***
## PC13        -0.07218    0.02072   -3.48  0.00053 ***
## PC14        -0.04633    0.02093   -2.21  0.02715 *  
## PC15         0.00506    0.02224    0.23  0.82018    
## PC16        -0.10543    0.02368   -4.45  9.8e-06 ***
## PC17         0.09713    0.02491    3.90  0.00011 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.491 on 711 degrees of freedom
## Multiple R-squared:  0.765,  Adjusted R-squared:  0.759 
## F-statistic:  136 on 17 and 711 DF,  p-value: <2e-16

R-squared: 0.76497, Adjusted R-squared: 0.75935.

(without PCA it was multiple R-squared: 0.819925, adjusted R-squared: 0.800768.)

MAE = 1.021314.

According to the R-squared the model became a bit worse, and the MAE has grown too.

4.5. Inspect New Linear model

Residuals vs Fitted plot - Linear relationship, that is good.

Normal Q-Q - the tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. It`s OK.

Scale-Location plot - let’s do Breusch Pagan Test to check for heteroskedasticity more formally:

Result: there is heteroskedasticity.

Residuals vs Leverage (observations whose standardized residuals are greater than 3 in absolute value are possible outliers) - there are few influential observations, but no outliers.

5. Search for differential proteins Using limma

Expression levels for 72 proteins were assessed.

5.1. Behavior

Let’s determine how many proteins changed their expression during “stimulation to learning”. SC was a baseline.

##   Down NotSig     Up 
##     40      9     23

When mouse are stimulated to learn, 40 proteins are downregulated, and 23 are upregulated.

Up-regulated proteins:

##  [1] "DYRK1A_N"        "ITSN1_N"         "BDNF_N"         
##  [4] "NR2A_N"          "pELK_N"          "pERK_N"         
##  [7] "PKCA_N"          "pPKCAB_N"        "BRAF_N"         
## [10] "ELK_N"           "ERK_N"           "GSK3B_N"        
## [13] "TRKA_N"          "APP_N"           "pNUMB_N"        
## [16] "NUMB_N"          "pGSK3B_N"        "CDK5_N"         
## [19] "S6_N"            "ADARB1_N"        "GFAP_N"         
## [22] "pGSK3B_Tyr216_N" "CaNA_N"

Down-regulated proteins:

##  [1] "pAKT_N"       "pBRAF_N"      "pCAMKII_N"   
##  [4] "pCREB_N"      "pJNK_N"       "pMEK_N"      
##  [7] "pNR1_N"       "pNR2A_N"      "pNR2B_N"     
## [10] "AKT_N"        "CAMKII_N"     "CREB_N"      
## [13] "MEK_N"        "RSK_N"        "Bcatenin_N"  
## [16] "SOD1_N"       "MTOR_N"       "P38_N"       
## [19] "pMTOR_N"      "DSCR1_N"      "AMPKA_N"     
## [22] "NR2B_N"       "RAPTOR_N"     "TIAM1_N"     
## [25] "pP70S6_N"     "AcetylH3K9_N" "RRP1_N"      
## [28] "ARC_N"        "nNOS_N"       "Tau_N"       
## [31] "IL1B_N"       "pCASP9_N"     "PSD95_N"     
## [34] "SNCA_N"       "Ubiquitin_N"  "SHH_N"       
## [37] "pS6_N"        "pCFOS_N"      "SYP_N"       
## [40] "ERBB4_N"

5.2. Treatment

Let’s determine how many proteins changed their expression when mice received memantine. Saline was a baseline.

##   Down NotSig     Up 
##     21     28     23

When mouse are received memantine, 21 proteins are down-regulated, and 23 are up-regulated.

Up-regulated proteins:

##  [1] "pAKT_N"      "pBRAF_N"     "pCAMKII_N"   "pJNK_N"     
##  [5] "pMEK_N"      "pNR2A_N"     "CREB_N"      "MTOR_N"     
##  [9] "P38_N"       "pMTOR_N"     "DSCR1_N"     "NR2B_N"     
## [13] "RAPTOR_N"    "ARC_N"       "nNOS_N"      "GluR3_N"    
## [17] "IL1B_N"      "pCASP9_N"    "PSD95_N"     "SNCA_N"     
## [21] "Ubiquitin_N" "pS6_N"       "ERBB4_N"

Down-regulated proteins:

##  [1] "ITSN1_N"    "BDNF_N"     "NR1_N"      "pPKCAB_N"  
##  [5] "AKT_N"      "ELK_N"      "ERK_N"      "Bcatenin_N"
##  [9] "SOD1_N"     "NUMB_N"     "pGSK3B_N"   "pPKCG_N"   
## [13] "CDK5_N"     "S6_N"       "ADARB1_N"   "RRP1_N"    
## [17] "BAX_N"      "Tau_N"      "GFAP_N"     "P3525_N"   
## [21] "CaNA_N"

5.3. Genotype

Let’s determine how many proteins changed their expression for Ts65Dn mouses. Control was a baseline.

##   Down NotSig     Up 
##     18     34     20

When mouse have Ts65Dn genotype, 18 proteins are down-regulated, and 20 are up-regulated.

Up-regulated proteins:

##  [1] "DYRK1A_N"        "ITSN1_N"         "pCREB_N"        
##  [4] "pRSK_N"          "TRKA_N"          "APP_N"          
##  [7] "TIAM1_N"         "pP70S6_N"        "NUMB_N"         
## [10] "pGSK3B_N"        "pPKCG_N"         "CDK5_N"         
## [13] "S6_N"            "AcetylH3K9_N"    "RRP1_N"         
## [16] "BAX_N"           "Tau_N"           "GFAP_N"         
## [19] "P3525_N"         "pGSK3B_Tyr216_N"

Down-regulated proteins:

##  [1] "NR1_N"   "NR2A_N"  "pNR1_N"  "pNR2A_N" "pNR2B_N"
##  [6] "AKT_N"   "MTOR_N"  "P38_N"   "pMTOR_N" "AMPKA_N"
## [11] "NR2B_N"  "ARC_N"   "GluR3_N" "IL1B_N"  "SNCA_N" 
## [16] "pS6_N"   "pCFOS_N" "SYP_N"

5.4. Class

Let’s determine how class affected protein expression levels. c-SC-s was a baseline.

Number of up- and -down-regulated proteins are in the table.

6. Summary

  1. There are 1080 rows and 82 variables in the raw data.

  2. There are 72 mice in experiment. All mouses belongs to one of 8 classes.

  3. There are mouses with 2 genotypes and 2 behavior under 2 types of treatment in dataset.

  4. The number of complete observations is 528 (there are 1080 observations in total).

  5. Violin and box-plots for each protein (after standardization) are in ..\data\plots\ folder.

  6. The levels of BDNF_N production depend on the class in the experiment (F = 19.465375, p_value = 1.29894^{-24} , df_1 = 7, df_2 = 1061).

  7. There are statistically significant differences in the levels of BDNF_N production depend on the class in the experiment when pairwise comparison of the production of various class (Tukey post-hock test).

Groups between which there was a statistically significant difference in BDNF_N protein production:

  • the level of BDNF_N production in the experimental group c-CS-m is statistically significantly higher than the level of BDNF_N production in the
    • c-SC-m group (t-statistic is -8.831039, p-value is 1.110223^{-16}),
    • c-SC-s group (t-statistic is -4.496425, p-value is 0.000212),
    • t-CS-m group (t-statistic is -4.975456, p-value is 2.050985^{-5}),
    • t-CS-s group (t-statistic is -6.394443, p-value is 3.377129^{-8}),
    • t-SC-m group (t-statistic is -3.218881, p-value is 0.028381).
  • the level of BDNF_N production in the experimental group c-CS-s is statistically significantly higher than the level of BDNF_N production in the
    • c-SC-m group (t-statistic is -9.1183, p-value is 0),
    • c-SC-s group (t-statistic is -4.892914, p-value is 2.941933^{-5}),
    • t-CS-m group (t-statistic is -5.359733, p-value is 2.994139^{-6}),
    • t-CS-s group (t-statistic is -6.728985, p-value is 2.020564^{-10}),
    • t-SC-m group (t-statistic is -3.648578, p-value is 0.006706).
  • the level of BDNF_N production in the experimental group c-SC-m is statistically significantly lower than the level of BDNF_N production in the
    • c-SC-s group (t-statistic is 4.09739, p-value is 0.001212),
    • t-CS-m group (t-statistic is 3.618359, p-value is 0.007494),
    • t-SC-m group (t-statistic is 5.391741, p-value is 4.201719^{-6}),
    • t-SC-s group (t-statistic is 6.197646, p-value is 8.693807^{-9}).
  • the level of BDNF_N production in the experimental group t-CS-s is statistically significantly lower than the level of BDNF_N production in the
    • t-SC-m group (t-statistic is 3.348428, p-value is 0.018877)
    • t-SC-s group (t-statistic is 4.091753, p-value is 0.001167).
  1. Linear model for ERBB4_N are built. re are 972 rows and 77 variables in the new data without NA.
    Residual standard error: 0.424 on 658 degrees of freedom.
    Multiple R-squared: 0.819925, Adjusted R-squared: 0.800768.
    F-statistic: 43 on 70 and 658 DF, p-value: <2e-16.
    MAE = 0.309649. The model contains too many predictors and some problems, it might be better.
    The model contains too many predictors, heteroskedasticity, and also multi-collinearity in data so it might be better.

  2. We made PCA for Linear model. There are 70 principal component, we used 17 (90% of the variability can be explained by 17 main components).
    R-squared: 0.76497, Adjusted R-squared: 0.75935.
    (without PCA it was multiple R-squared: 0.819925, adjusted R-squared: 0.800768.)
    MAE = 1.021314.
    According to the R-squared the model became a bit worse, and the MAE has grown too.

  3. Search for differential proteins Using limma - Expression levels for 72 proteins were assessed.

  • When mouse are stimulated to learn, 40 proteins are downregulated, and 23 are upregulated.
  • When mouse are received memantine, 21 proteins are down-regulated, and 23 are up-regulated.
  • When mouse have Ts65Dn genotype, 18 proteins are down-regulated, and 20 are up-regulated.